Research data warehouse: using electronic health records to conduct population-based observational studies

Abstract Background Electronic health records and many legacy systems contain rich longitudinal data that can be used for research; however, they typically are not readily available. Materials and methods At Kaiser Permanente Southern California (KPSC), a research data warehouse (RDW) has been developed and maintained since the late 1990s and widely extended in 2006, aggregating and standardizing data collected from internal and a few external sources. This article provides a high-level overview of the RDW and discusses challenges common to data warehouses or repositories for research use. To demonstrate the application of the data, we report the volume, patient characteristics, and age-adjusted prevalence of selected medical conditions and utilization rates of selected medical procedures. Results A total of 105 million person-years of health plan enrollment was recorded in the RDW between 1981 and 2018, with most healthcare utilization data available since early or middle 1990s. Among active enrollees on December 31, 2018, 15% were ≥65 years of age, 33.9% were non-Hispanic white, 43.3% Hispanic, 11.0% Asian, and 8.4% African American, and 34.4% of children (2–17 years old) and 72.1% of adults (≥18 years old) were overweight or obese. The age-adjusted prevalence of asthma, atrial fibrillation, diabetes mellitus, hypercholesteremia, and hypertension increased between 2001 and 2018. Hospitalization and Emergency Department (ED) visit rates appeared lower, and office visit rates seemed higher at KPSC compared to the reported US averages. Discussion and conclusion Although the RDW is unique to KPSC, its methodologies and experience may provide useful insights for researchers of other healthcare systems worldwide in the era of big data analysis.


INTRODUCTION
Although electronic systems have been implemented in large health organizations to support patient care, billing, and administration for decades, the Health Information Technology for Economic and Clinical Health (HITECH) Act enacted in 2009 1 accelerated both the pace and the uniformity of data collection, thus making electronic health records (EHR) more attractive resources for observational research. [2][3][4][5][6][7][8] Allowing researchers access to a large number of subjects with longitudinal health care records, so-called "big-data" can provide a quicker, more comprehensive, and cost-effective approach to access individual-level health care records. Notable EHRbased data sources include the Clinical Practice Research Datalink (CPRD), a primary care database that provides researchers around the world the opportunity to access National Health Service data from the UK, 9 and the Veterans Administration's Corporate Data Warehouse (CDW), a repository derived from the Veterans Health Administration's electronic medical records system called Computerized Patient Record System (CPRS)/VistA system. 10 Administrative data collected by healthcare organizations at the time of enrollment and during patient care are only sometimes readily available for research. First, the same data type (eg, hospital admission) may come from multiple data sources with various formats and inconsistent values. Second, the change of source data systems over time may leave the data fragmented. Third, the information collected administratively may be incomplete or contain duplicate entries. Fourth, data quality concerns critical to research usually need to be better documented than is possible with administrative data. Medical insurance claims data is a good example of electronic data that requires tremendous consolidation efforts. Although medical insurance data have been commonly used for research, 11,12 the creation of encounters based on medical claims data has been a challenge, 12 because the data are submitted for reimbursement purposes and thus do not represent the episode of care (eg, multiple claims for one hospital stay). Therefore, developing and maintaining a well-documented large-scale data infrastructure to support various research projects is crucial. This report will describe the contents, development, maintenance methodology, and other aspects of a research data warehouse (RDW) within a large integrated healthcare system, Kaiser Permanente Southern California (KPSC). To demonstrate the application of the data in the RDW and the volume of data that can be used for various population-based research projects, we report descriptively the (1) total lengths of enrollment history, (2) demographic, behavioral, and other characteristics, (3) annual counts and age-adjusted prevalence of common chronic conditions, and (4) annual counts and rates of medical procedures performed in KPSC health plan enrollees.

Population and environment
Founded in 1945, KPSC provides integrated care to 4.8 million health plan enrollees at facilities throughout the Southern California region (Figure 1). At its 15 medical centers and 235 medical offices, some 7800 physicians and 25 000 nurses provide comprehensive care, including hospital admission, ambulatory care, ED, urgent care, office visits, optometry, rehabilitation, home health, and hospice care. Outside providers also render medical care for planned, referred, or emergency services. KPSC members are racially and ethnically diverse and represented 21% of the region's population in 2020. The demographics and socio-economic status, including race and ethnic composition of the enrollees, are representative of those living in the region. 13 This study was approved by the institutional review board of KPSC with a waiver for the requirement to obtain informed consent.

HealthConnect
HealthConnect, customized and branded by Kaiser Permanente, is one of the largest civilian health care systems in the world. This comprehensive system replaced hundreds of legacy systems to integrate clinical records with appointments, registration (check-in), clinical documentation, orders, hospital admission/discharge/transfer/medication administration, and billing. In addition to the information that care providers directly enter, a large amount of data is loaded from other systems (eg, lab results) via an electronic interface. A unique medical record number is assigned at the time of health plan enrollment, and this number remains unchanged if an enrollee leaves and rejoins KPSC. In KPSC, the system was implemented between 2004 and 2008.

Convergent medical terminology
With more than 75 000 concepts, convergent medical terminology (CMT) is a KP-wide solution to provide a consistent structure for medical terminology. 14 The core of CMT comprises SNOMED-Clinical Terms (CT), [15][16][17] Laboratory Logical Observation Identifiers Names and Codes (LOINC), [17][18][19] and First Databank drug terminology. 20,21 When a care provider diagnoses a patient's medical conditions and enters them into KP HealthConnect, they are translated into CMT concepts. Most diagnosis-related concepts can be mapped to the International Classification of Diseases (ICD) codes in the backend database for the purpose of billing, reporting, and research. The same process occurs when a physician orders a medical procedure. A procedure-related concept (eg, lab, radiology) is mapped to a Current Procedural Terminology (CPT) code if such a mapping is feasible. Hospital-based services, including ED visits, are coded by professional coders based on international standard codes, and therefore, there is a lag of several weeks for hospital-based coding to be complete. For claims, the data are usually at least 90% complete within 3 months.

Terminology standards
Terminology standards provide a foundation for clear interoperability and improved efficiency for information exchange. They are necessary for reproducibility and data consistency in collaborations across multi-organizations. A number of national/international and KPSC internal terminology standards are stored in the RDW ( Table 1). Most of the KPSC internal codes can be mapped to national/international codes.

Research data warehouse
The RDW was developed by integrating many data sources from the KPSC legacy and EHR systems as well as external data sources such as the American Community Survey into several different data marts (subsets of the RDW focusing on specific data content areas). The data update frequency varies by data content area with most patient care data updated on a weekly basis. The data content areas, development cycle, platform/storage, and data governance are described below.

Data content areas
The RDW contains a wide range of data content areas (domains). The most commonly used data content areas are shown in Figure 2. The RDW also contains radiology and pathology reports. At this point, digital videos/images in DICOM format and other types of notes, such as progress notes and discharge notes, are not included in RDW.

Development of business rules and data modeling
Rules are defined to satisfy business needs. The extract, transform, load (ETL) process includes determining eligible data sources and records, data elements of interest, how these sources/records/data elements should be integrated and standardized, and how the final results are loaded into relational database tables ( Figure 3). When data are extracted from multiple systems, the definitions, formats, and structures may vary; thus, the process is typically more complicated. In this step, the ETL developers also validate the rules using real data extracted from the sources. For illustration, we described (1) the rules of record consolidation of medical claims submitted by outside medical professionals to convert claims into episodes of care and (2) the rules of record cleaning/standardization of height and weight information in the vital signs data mart in Supplementary File S1.
ETL, integration testing, and analytical data quality check After developing data models, the desired data are extracted, transformed, and loaded from the sources to a staging environment and subsequently to the target tables in a test or development environment according to the business rules and data models defined in the previous steps. Next, unit testing (to ensure that the ELT module correctly handles the target data) and integration testing (to ensure that all new and existing ETL components work together seamlessly) are performed. Test scenarios are prepared, and test cases are created. This testing includes the following tasks.
• Validation of counts between source data and final destination tables. • Comparison of new data with current production data (if applicable). • Validation of the existence of no unexpected duplication of data or orphaned records. • Verification of accepted, default, invalid, or missing values.  The ETL scripts are written by an ETL developer, reviewed by a second developer, and finally the data are reviewed by a quality assurance specialist before they are implemented to ensure that the scripts correctly reflect the specified requirements and perform appropriately.
For minor deployments, once the testing and quality assurance (QA) is complete, the module is deployed into production. This process involves the backup of existing tables. After the first run of the module, a quality assurance analyst then checks the data quality using the approach described in the Section "Data Quality Monitoring" (referred to as "analytical QA"). If an error is identified in the integration testing or analytical QA, the ETL scripts will be modified accordingly, and the code review, testing and analytical QA will be repeated.
With the release of the new databases/tables, a detailed documentation is published online within the KP network. The document includes at least the following sections: overview, data sources, data structures, sample program(s), data usage guideline, and potential data quality issues. For new development or major significant revisions of databases, user training is provided to inform users about the changes and their impact on research studies.
User acceptance testing, deployment, documentation, and user training User acceptance testing (UAT) is performed for major deployments by the intended users/clients of the developed database or table. The UAT testers typically first identify test scenarios and create test cases based on business needs, then execute the test cases against the developed database, and document and share the test results with the developers. If any identified issues are caused by incorrect business rules or errors in ETL codes, the steps described in the previous 2 sections will be repeated. The success of UAT leads to the deployment of the data into the production environment.  Figure 2. Common data marts. *Emergency department visits, inpatient stays, outpatient clinical visits (including urgent care and therapy), outside claims, home healthcare, skilled nursing facility use, and hospice care by both primary and specialty care providers. **Requested through KPSC Cancer Registry. ***Progress notes, ED provider notes, nursing notes, discharge summary notes, procedure notes and other types of notes are available from the reporting environment of HealthConnect.

Data quality monitoring
Monitoring data quality on a timely basis is an integral part of the RDW maintenance. There are a large number of automated scripts that provide email alerts to an on-call team of specialists that indicate whether a specific job has failed (and thus needs to be re-run), whether unexpected values for a particular data element have been detected (eg, a new CPT or LOINC value in the source data), or whether there are unexpected counts in the loaded data (eg, a decrease in counts). In addition, on a routine basis, QA reports are generated and posted to a shared location where users have access. The updated frequencies of the QA reports match the updated frequencies of the data. For example, weekly reports are generated for weekly updated data. These reports are presented in both table and graphic formats to allow for the examination of yearly, monthly, and weekly trends, as well as changes from previous versions of the data (Supplementary File S2).

Platforms and storage
The largest RDW tables are hosted in a high-performance Oracle Exadata environment which leverages a massively parallel grid architecture. This server is located within a KP secure data center. Smaller and less-used tables are currently hosted in an Oracle environment but are being migrated to the above-mentioned Exadata environment. ETL scripts run on a SAS Business Intelligence platform consists of 3 computational servers and 3 metadata servers within a grid environment. With the flexible data management technology, the SAS platform is also used for many data integration processes with the remainder being performed by PL/SQL packages running directly in the Exadata environment. The warehouse architecture and ETL designs were created to keep minimum data transfer across the KP network.

Scientific input
The Data Scientific Oversight Committee, representing diverse perspectives of scientists, statisticians, programmers, and project managers, provides scientific guidance and oversight on RDW development. More specifically, it evaluates proposals for new database/table development and provides recommendations on business rules of data contents and development priorities.

Data governance
The management of the RDW is governed by the existing policies and standard operation procedures within the Department of Research and Evaluation and those enforced by the Kaiser Permanente Information Technology and the Technology Risk Assessment Office. Only people who have a business justification for using the data have access. Sensitive data such as social security numbers are protected using Oracle's Virtual Private Database feature, which are database policies that are automatically enforced by controlling access at the column or row level. Any requests to access sensitive data require 3 levels of approval. The first is by the requester's manager, followed by the data owner, and finally by the data service manager. Due to patient confidentiality, direct access to the RDW is not granted to external researchers; however, access to the data may be permitted through collaboration with KPSC researchers.  Figure 3. Data warehouse architecture.

Change management
Small changes to the data (eg, the introduction of a new data element) may result in a relatively minor hotfix to the ETL which is deployed quickly. Such minor changes are typically identified in one or more of 3 ways: bugs reported by users, automated monitoring alerts (eg, a new value of a key variable), or manually reviewed QA reports. Major changes to the data arising from a major change in the source data system (eg, the retirement of one data system and its replacement with another one) occur less frequently and result in a larger project which is then prioritized based on multiple factors including the urgency (eg, scheduled system change), the complexity of the work, and the available resources. The priorities are often discussed within the Data Scientific Oversight Committee. Addition of new functionalities also follows the same process.
Change requests to the database structure or scripts in production are evaluated by the Change Management Committee. The Committee reviews and approves the requests based on factors such as types of changes (ie, enhancement, defect, database/application, server updates), user impact, complexity of implementation plans including rollback plan in case of failure, outage requirement, priority, and potential risks.

Cost-saving statements
The RDW dramatically simplifies data extraction process. First, it is much more convenient to extract data elements from RDW than from original data sources. As an example, data elements commonly used for research are extracted from more than 150 tables in the source EHR and claims data warehouses, transformed by ETL scripts and loaded into a data mart in RDW with only 14 tables for users to use. Second, the RDW contains external data sources that are both challenging and time-consuming to extract. For example, death and cause of death are derived from vital statistics records through a complicated linkage process. 22,23 Neighborhood level measures (eg, neighborhood deprivation index) are appended after health plan enrollees' addresses are geocoded. 24 Each linkage or geocoding process involves a degree of expert knowledge and takes some 2-3 weeks to perform. Finally, the RDW includes legacy data related to utilization (encounter, procedure, and diagnosis data), labs, pharmacy, radiology, and membership which are generally no longer available outside the RDW. As a concrete example to demonstrate cost saving, one might consider a typical research project that needs to make use of health care utilization, diagnosis, medical procedures, vital signs, pharmacy, labs, membership, demographics, and neighborhood level data (eg, neighborhood deprivation index). Using the RDW, a user would only need to extract data from 12 database tables or views (a view is the result of a prewritten, stored query that are saved within the database users can query much as they would a table) in the RDW. However, if the user were to use the source data, they would need to query more than 200 tables from 3 different databases. He/she would also need to perform complex joins and transformations on these tables which the RDW ETL performs. In addition, he/she would need to link study subjects to vital statistics data as well as to geocode their addresses to enable the usage of communitylevel data. Conservatively, the process of extracting data from the sources would take at least 4 weeks assuming that the user is very familiar with the various source systems and the linkage process.

Number of users
There are currently 114 active users of the RDW measured as individuals with accounts and 85 measured as users who have extracted data from the RDW over the last 12 months. In addition, there are hundreds of passive users who use a webbased application to access partial data in the RDW for the purpose of research feasibility assessment.

Allocation of resources
The allocation of efforts changes overtime as the business needs and the maturity of the RDW increase. On an average, more than half of the efforts are spent on development/ enhancement and implementation of new database tables and/or new data elements development, and about 10-20% of efforts are used for maintenance which includes ETL and data quality monitoring.

Application
To demonstrate the volume and the application of the integrated data in the RDW, we reported (1) 25,26 Enrollees who were not enrolled in the KPSC health plan for at least 11 months within each calendar year were excluded from the analyses of (3) and (4) above. The race/ethnicity information is derived from multiple data sources based on a sophisticated algorithm (including imputation). The information is mainly self-reported since 2009, and the data sources were previously described. 27 The information on smoking status (current, passive, quit, and never) is collected during clinic visits. The descriptions of the selected chronic conditions are listed in the footnote of Table 4 and Supplementary File S3. Diabetes, heart failure, and hypertension were each defined using 2 different algorithms (Supplementary File S3) to demonstrate the impact of the variations on disease prevalence estimates.  The age-adjusted prevalence of the selected medical conditions are shown in Table 4 The age-adjusted annual medical utilization rates, including common medical/surgical procedures, are reported in Table 4. During the study period, the age-adjusted annual hospital admission, ED visit, and office visit rates decreased by 36% (from 99.7 to 64.2/1000 adult enrollees), 30% (from 376.3 to 261.6/1000 adult enrollees), and 13% (from 6.3 to 5.5/adult

DISCUSSION
We created an RDW with integrated information from multiple internal and external data sources. With a total of 105 million person-years of health plan enrollment between 1981 and 2018, the RDW provides individually linked and sourceintegrated administrative and clinical data for researchers to conduct population-based studies. The integrated and standardized data have offered convenience and consistency to thousands of research projects conducted at KPSC since it was first established in the late 1990s. Majority of the studies published by KPSC researchers (eg, more than 650 publications in peer-reviewed journals in 2022) are supported by the RDW.
Data in the RDW serve multiple types of research projects. For epidemiological studies, health plan enrollment information in the RDW provides denominators to estimate prevalence and incidence of health conditions, risk factors, and diseases. Retrospective cohorts based on predefined inclusion/ exclusion criteria can be assembled to evaluate the natural history of disease including treatments and responses through passive follow-up. Patients with prespecified diseases or exposure can be identified for screening and intervention with prospective monitoring. For health outcomes studies, the information can be used to understand practice patterns, define quality measures/outcomes, and evaluate the success of implemented interventions. 28 For clinical informaticists, longitudinal and high-dimensional data can be utilized to develop and validate risk prediction models. For trialists, it is possible that the near real-time data may help identify potentially eligible study participants; however, additional filtering by manual chart review is often needed.
Compared to many large-size integrated clinical data warehouses or repositories within the United States previously reported such as Mayo Enterprise Data Trust developed by Mayo Clinic 29 and the Synthetic Derivative developed by Vanderbilt University Medical Center, 30 the RDW at KPSC is larger in size. The Veterans Health Administration' CDW contains an impressive 9 million Veterans and covers multiple geographic areas 10 ; however, the enrollees do not represent the United States general population because most of them are males, married and non-Hispanic white. 31 Gagalova et al 32 reviewed and compared more than 20 clinical research data warehouses with various sizes, data sources, and architecture models. Some of them contain data types that are not offered by the RDW. For example, the KPSC RDW does not include "omics" data and 2D or 3D images. Future clinical data warehouses may consider these types of data as the expenses for storage decrease.
The low prevalence of certain chronic conditions prior to 2007 could be due to under-coding of these conditions before the EHR was fully implemented at KPSC. Since the EHR was implemented (phased in over a period of 4 years), physicians are required to provide diagnosis codes before they can sign off medication orders, thus resulting in more complete coding of medical conditions. The majority of the age-adjusted prevalence rates of chronic conditions estimated in the current study appeared to be lower compared to those of California residents based on surveys. [33][34][35][36][37] For example, the percent of adults who have been told they currently have asthma was in the range of 7.7-8.8% in California in 2011-2018, according to the Behavioral Risk Factor Surveillance System survey. 33 This could be explained, at least partially, by the fact that self-reported health indicators, such as those derived from the Behavioral Risk Factor Surveillance System and the National Health Interview Survey, tend to be more complete compared to those that are derived from physician diagnoses, because not all patients seek medical care on an annual basis.
The current study defined 3 chronic conditions using different algorithms. For diabetes mellitus, the 2 algorithms resulted in comparable rates; however, the 2 algorithms of heart failure and hypertension yielded substantial differences (eg, rates based on definition 1 being 2 or 3 times larger than those of definition 2). The decision on which algorithm to choose is study specific. One may apply the more specific algorithm (eg, definition 2 for heart failure or hypertension) if the goal is to identify a group of patients with high positive predictive value.
According to the NHIS, 5.8%, 7.7%, and 16.7% of adults aged 18-44, 45-64, and 65 had overnight hospital stays in 2018. 38 40 Given the large volume of data, data quality assessment is typically conducted by researchers specialized in specific medical area(s) with data elements commonly used for the research of these medical fields. For example, the accuracy of reporting maternal and fetal clinical diagnosis and procedural coding was validated by a team of perinatology researchers at KPSC. 41 Data quality assessment domains and the assessment approach for each domain are described by Feder 42 and Weiskopf et al 8 A pragmatic framework was proposed to assess data quality in EHR-based clinical research for both single and multi-site studies. 43 Data quality frameworks have been developed to clean data extracted from the EHR. 44 Efforts are needed within each research project to validate and clean the project-specific data and to report any relevant quality issues.
There are apparent challenges when EHR data are being used for research. Because EHR data are not collected specifically to support research, they may bear the following weaknesses pertaining to data quality. First, coding for certain diseases provided by physicians may be incomplete or inaccurate. [45][46][47] The level of incompleteness increases if the codes are not used for billing purposes. The problem is more severe for diseases defined by signs and symptoms. In 2 studies conducted at KPSC, Zheng et al 48 and Yu et al 49 revealed that using ICD-based codes alone to identify local reaction and anaphylaxis in vaccine safety studies could result in incomplete results because approximately 47% and 50% additional local reaction and anaphylaxis cases, respectively, can be identified at KPSC, when natural language processing of clinical notes was applied. In addition, the documentation in clinical notes could also be incomplete and inaccurate. 50 Second, certain data terminologies and codes in EHRs lack standardization and, thus, cannot be used conveniently for research. The development of the RDW is a process of standardization in which the codes/values from multiple data sources are consolidated based on the clinical meaning of the codes/values. Pathak et al provided a framework in the SHARPn project to transform heterogeneous structured and unstructured data (eg, clinical notes) into a uniform and standardized infrastructure. 51,52 More standards are expected to be implemented in commercial EHR software in the future. Third, a lack of consistent standardization across different organizations could occur due to different practice patterns and different terminology standards for services or products (even for the same EHR system/module). To provide a shared data structure to support multi-site studies, various types of research networks have implemented common data models. [53][54][55][56] The adoption of common data models has advantages. For example, they provide fast access, large statistical power, and transparent duplication of analyses in multiple databases. However, information mapping from the source databases to the common data models could be incomplete. 57 Multi-site validation is encouraged to improve the algorithm accuracy, as demonstrated in the eMERGE network in which thirteen EHR-based phenotype algorithms, including dementia, type 2 diabetes, and height were created and validated. 58 In addition to the challenges mentioned above, the KPSC RDW has several limitations. First, the data in the warehouse is not real-time, thus studies that require real-time data need to extract the data directly from the original sources. Second, the availability of historical data depends on when the legacy systems were implemented. Third, the quality of the data varies depending on the clinical workflow and specialty area. For example, the slight decline in the prevalence of hypertension by definition 2 (defined by encounter diagnosis codes and hypertensive medications) could be due to the decrease in coding in recent years.
Despite the limitations mentioned above, the RDW at KPSC is advantageous compared to data derived solely from medical claims, for several reasons. First, medical records are available to conduct chart review for data/algorithm validation. Second, upcoding, which occurs when more serious conditions are coded and submitted than actually observed, an issue known for medical claims data, is unlikely to happen within KPSC because the service is prepaid for the vast majority of the patients. Third, clinical guidelines are strictly followed by KPSC providers to make clinical decisions/ recommendations due to decision support tools and built-in system controls. These national or internal guidelines are developed based on medical research and are evidence-based. Therefore, medical practice across KP hospitals and medical offices could be more consistent. Finally, physician-and facility-level data are available to assess the impact of these characteristics on the outcomes of interest in addition to patient-level characteristics.

CONCLUSION
The RDW provides a useful resource for the conduct of observational research based on EHR data, making the clinically and administratively collected data readily available for researchers. More comprehensive data quality assessment tools can be implemented to provide high-quality data for various types of research. More validation is encouraged to demonstrate the validity and report potential quality issues.

DATA AVAILABILITY
The data underlying this article cannot be shared publicly due to the privacy of individuals that are included in the study.